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Abstract — 

In  1991,  Ballard  [1]  described  the  implications  of  hav¬ 
ing  a  visual  system  that  could  actively  position  the  camera 
coordinates  in  response  to  physical  stimuli.  In  humanoid 
robotic  systems,  or  in  any  animate  vision  system  that  inter¬ 
acts  with  people,  social  dynamics  provide  additional  levels 
of  constraint  and  provide  additional  opportunities  for  pro¬ 
cessing  economy.  In  this  paper,  we  describe  an  integrated 
visual-motor  system  that  has  been  implemented  on  a  hu¬ 
manoid  robot  to  negotiate  the  robot’s  physical  constraints, 
the  perceptual  needs  of  the  robot’s  behavioral  and  motiva¬ 
tional  systems,  and  the  social  implications  of  motor  acts. 

Keywords —  Active  vision,  robots,  social  interaction  with 
humans. 

I.  Introduction 

Animate  vision  introduces  requirements  for  real-time 
processing,  removes  simplifying  assumptions  of  static  cam¬ 
era  systems,  and  presents  opportunities  for  simplifying 
computation.  This  simplification  arises  through  situating 
perception  in  a  behavioral  context,  by  providing  for  op¬ 
portunities  to  learn  flexible  behaviors,  and  by  allowing  the 
exploitation  of  dynamic  regularities  of  the  environment  [1]. 
These  benefits  have  been  of  critical  interest  to  a  variety 
of  humanoid  robotics  projects  [2],  [3],  [4],  [5],  and  to  the 
robotics  and  AI  communities  as  a  whole.  On  a  practical 
level,  the  vast  majority  of  these  systems  are  still  limited  by 
the  complexities  of  perception  and  thus  focus  on  a  single 
aspect  of  animate  vision  or  concentrate  on  the  integration 
of  two  well-known  systems.  On  a  theoretical  level,  exist¬ 
ing  systems  often  do  not  benefit  from  the  advantages  that 
Ballard  proposed  because  of  their  limited  scope. 

In  humanoid  robotics,  these  problems  are  particularly 
evident.  Animate  vision  systems  that  provide  only  a  lim¬ 
ited  set  of  behaviors  (such  as  supporting  only  smooth  pur¬ 
suit  tracking)  or  that  provide  behaviors  on  extremely  lim¬ 
ited  perceptual  inputs  (such  as  systems  that  track  only 
very  bright  light  sources)  fail  to  provide  a  natural  interac¬ 
tion  between  human  and  robot.  We  propose  that  in  order 
to  allow  realistic  human-machine  interactions,  an  animate 
vision  system  must  address  a  set  of  social  constraints  in 
addition  to  the  other  issues  that  classical  active  vision  has 
addressed. 

It  is  useful  to  view  social  constraints  not  as  limitations, 
but  opportunities.  They  induce  a  natural  “vocabulary” 
to  make  the  robot’s  behavior  and  state  readable  by  a  hu¬ 
man.  This  in  turn  can  provide  a  framework  for  the  robot 
to  negotiate  a  change  in  a  human’s  behavior.  For  example, 
section  XI- A  discusses  a  procedure  the  robot  can  use  to 
control  the  “inter-personal”  distance  a  human  assigns  to 
it.  Having  this  control  means  that  in  social  situations  the 

Artificial  Intelligence  Laboratory,  Massachusetts  Insti¬ 
tute  of  Technology  (MIT),  Cambridge,  MA.  E-mail:  Cyn¬ 
thia, edsinger, paulfitz, sc  az@ai.mit.edu 


robot  can  get  quite  far  with  a  simple  vision  system  that  is 
tuned  to  a  particular  distance. 

II.  Social  constraints 

For  robots  and  humans  to  interact  meaningfully,  it  is 
important  that  they  understand  each  other  enough  to  be 
able  to  shape  each  other’s  behavior.  This  has  several  im¬ 
plications.  One  of  the  most  basic  is  that  robot  and  human 
should  have  at  least  some  overlapping  perceptual  abilities. 
Otherwise,  they  can  have  little  idea  of  what  the  other  is 
sensing  and  responding  to.  Vision  is  one  important  sen¬ 
sory  modality  for  human  interaction,  and  the  one  we  focus 
on  in  this  article.  We  endow  our  robots  with  visual  per¬ 
ception  that  is  human-like  in  its  physical  implementation. 


Fig.  1.  Kismet,  a  robot  capable  of  conveying  intentionality  through 
facial  expressions  and  behavior.  Here,  the  robot’s  physical  state  ex¬ 
presses  attention  to  and  interest  in  the  human  beside  it.  Another 
person  -  for  example,  the  photographer  -  would  expect  to  have  to  at¬ 
tract  the  robot’s  attention  before  being  able  to  influence  its  behavior. 

Similarity  of  perception  requires  more  than  similarity  of 
sensors.  Not  all  sensed  stimuli  are  equally  behaviorally  rel¬ 
evant.  It  is  important  that  both  human  and  robot  find 
the  same  types  of  stimuli  salient  in  similar  conditions.  Our 
robots  have  a  set  of  perceptual  biases  based  on  the  human 
pre-attentive  visual  system.  These  biases  can  be  modulated 
by  the  motivational  state  of  the  robot,  making  later  percep¬ 
tual  stages  more  behaviorally  relevant.  This  approximates 
the  top-down  influence  of  motivation  on  the  bottom-up  pre- 
attentive  process  found  in  human  vision. 

Visual  perception  requires  high  bandwidth  and  is  compu¬ 
tationally  demanding.  In  the  early  stages  of  human  vision, 
the  entire  visual  field  is  processed  in  parallel.  Later  com¬ 
putational  steps  are  applied  much  more  selectively,  so  that 
behaviorally  relevant  parts  of  the  visual  field  can  be  pro¬ 
cessed  in  greater  detail.  This  mechanism  of  visual  attention 
is  just  as  important  for  robots  as  it  is  for  humans,  from  the 
same  considerations  of  resource  allocation.  The  existence 
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of  visual  attention  is  also  key  to  satisfying  the  expectations 
of  humans  concerning  what  can  and  cannot  be  perceived 
visually.  We  have  implemented  a  context-dependent  atten¬ 
tion  system  that  goes  some  way  towards  this. 

Human  eye  movements  have  a  high  communicative  value. 
For  example,  gaze  direction  is  a  good  indicator  of  the  locus 
of  visual  attention.  Knowing  a  person’s  locus  of  attention 
reveals  what  that  person  currently  considers  behaviorally 
relevant,  which  is  in  turn  a  powerful  clue  to  their  intent. 
The  dynamic  aspects  of  eye  movement,  such  as  staring  ver¬ 
sus  glancing,  also  convey  information.  Eye  movements  are 
particularly  potent  during  social  interactions,  such  as  con¬ 
versational  turn-taking,  where  making  and  breaking  eye 
contact  plays  an  important  role  in  regulating  the  exchange. 
We  model  the  eye  movements  of  our  robots  after  humans, 
so  that  they  may  have  similar  communicative  value. 

Our  hope  is  that  by  following  the  example  of  the  human 
visual  system,  the  robot’s  behavior  will  be  easily  under¬ 
stood  because  it  is  analogous  to  the  behavior  of  a  human 
in  similar  circumstances  (see  Figure  1).  For  example,  when 
an  anthropomorphic  robot  moves  its  eyes  and  neck  to  ori¬ 
ent  toward  an  object,  an  observer  can  effortlessly  conclude 
that  the  robot  has  become  interested  in  that  object.  These 
traits  lead  not  only  to  behavior  that  is  easy  to  understand 
but  also  allows  the  robot’s  behavior  to  fit  into  the  social 
norms  that  the  person  expects. 

There  are  other  advantages  to  modeling  our  implemen¬ 
tation  after  the  human  visual  system.  There  is  a  wealth 
of  data  and  proposed  models  for  how  the  human  visual 
system  is  organized.  This  data  provides  not  only  a  modu¬ 
lar  decomposition  but  also  mechanisms  for  evaluating  the 
performance  of  the  complete  system.  Another  advantage 
is  robustness.  A  system  that  integrates  action,  perception, 
attention,  and  other  cognitive  capabilities  can  be  more  flex¬ 
ible  and  reliable  than  a  system  that  focuses  on  only  one  of 
these  aspects.  Adding  additional  perceptual  capabilities 
and  additional  constraints  between  behavioral  and  percep¬ 
tual  modules  can  increase  the  relevance  of  behaviors  while 
limiting  the  computational  requirements  [6].  For  example, 
in  isolation,  two  difficult  problems  for  a  visual  tracking  sys¬ 
tem  are  knowing  what  to  track  and  knowing  when  to  switch 
to  a  new  target.  These  problems  can  be  simplified  by  com¬ 
bining  the  tracker  with  a  visual  attention  system  that  can 
identify  objects  that  are  behaviorally  relevant  and  worth 
tracking.  In  addition,  the  tracking  system  benefits  the  at¬ 
tention  system  by  maintaining  the  object  of  interest  in  the 
center  of  the  visual  field.  This  simplifies  the  computation 
necessary  to  implement  behavioral  habituation.  These  two 
modules  work  in  concert  to  compensate  for  the  deficiencies 
of  the  other  and  to  limit  the  required  computation  in  each. 

III.  Physical  form 

Currently,  the  most  sophisticated  of  our  robots  in  terms 
of  visual-motor  behavior  is  Kismet.  This  robot  is  an  ac¬ 
tive  vision  head  augmented  with  expressive  facial  features 
(see  Figure  2).  Kismet  is  designed  to  receive  and  send 
human-like  social  cues  to  a  caregiver,  who  can  regulate  its 
environment  and  shape  its  experiences  as  a  parent  would 


for  a  child.  Kismet  has  three  degrees  of  freedom  to  control 
gaze  direction,  three  degrees  of  freedom  to  control  its  neck, 
and  fifteen  degrees  of  freedom  in  other  expressive  compo¬ 
nents  of  the  face  (such  as  ears  and  eyelids).  To  perceive 
its  caregiver  Kismet  uses  a  microphone,  worn  by  the  care¬ 
giver,  and  four  color  CCD  cameras.  The  positions  of  the 
neck  and  eyes  are  important  both  for  expressive  postures 
and  for  directing  the  cameras  towards  behaviorally  relevant 
stimuli. 
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Fig.  2.  Kismet  has  a  large  set  of  expressive  features  -  eyelids,  eye¬ 
brows,  ears,  jaw,  lips,  neck  and  eye  orientation.  The  schematic  on 
the  right  shows  the  degrees  of  freedom  relevant  to  visual  perception 
(omitting  the  eyelids!).  The  eyes  can  turn  independently  along  the 
horizontal  (pan),  but  turn  together  along  the  vertical  (tilt).  The 
neck  can  turn  the  whole  head  horizontally  and  vertically,  and  can 
also  crane  forward.  Two  cameras  with  narrow  “foveal”  fields  of  view 
rotate  with  the  eyes.  Two  central  cameras  with  wide  fields  of  view 
rotate  with  the  neck.  These  cameras  are  unaffected  by  the  orientation 
of  the  eyes. 

The  cameras  in  Kismet’s  eyes  have  high  acuity  but  a 
narrow  field  of  view.  Between  the  eyes,  there  are  two  un¬ 
obtrusive  central  cameras  fixed  with  respect  to  the  head, 
each  with  a  wider  field  of  view  but  correspondingly  lower 
acuity.  The  reason  for  this  mixture  of  cameras  is  that  typ¬ 
ical  visual  tasks  require  both  high  acuity  and  a  wide  field 
of  view.  High  acuity  is  needed  for  recognition  tasks  and 
for  controlling  precise  visually  guided  motor  movements. 
A  wide  field  of  view  is  needed  for  search  tasks,  for  tracking 
multiple  objects,  compensating  for  involuntary  ego-motion, 
etc.  A  common  trade-off  found  in  biological  systems  is  to 
sample  part  of  the  visual  field  at  a  high  enough  resolution 
to  support  the  first  set  of  tasks,  and  to  sample  the  rest 
of  the  field  at  an  adequate  level  to  support  the  second  set. 
This  is  seen  in  animals  with  foveate  vision,  such  as  humans, 
where  the  density  of  photoreceptors  is  highest  at  the  cen¬ 
ter  and  falls  off  dramatically  towards  the  periphery.  This 
can  be  implemented  by  using  specially  designed  imaging 
hardware  [7],  space- variant  image  sampling  [8],  or  by  using 
multiple  cameras  with  different  fields  of  view,  as  we  have 
done. 

The  designs  of  our  robots  are  constantly  evolving.  New 
degrees  of  freedom  are  added,  old  degrees  of  freedom  are 
reorganized,  sensors  are  replaced  or  rearranged,  new  sen¬ 
sory  modalities  are  introduced.  The  descriptions  given  here 
should  be  treated  as  a  fleeting  snapshot  of  the  current  state 
of  the  robots.  Our  hardware  and  software  control  architec¬ 
tures  have  been  designed  to  meet  the  challenge  of  real-time 
processing  of  visual  signals  (approaching  30  Hz)  with  min¬ 
imal  latencies.  Kismet’s  vision  system  is  implemented  on 
a  network  of  nine  400  MHz  commercial  PCs  running  the 
QNX  real-time  operating  system.  Kismet’s  motivational 


IEEE  TRANSACTIONS  ON  MAN,  CYBERNETICS  AND  SYSTEMS,  VOL.  XX,  NO.  Y,  MONTH  2000 


3 


system  runs  on  a  collection  of  four  Motorola  68332  pro¬ 
cessors.  Machines  running  Windows  NT  and  Linux  axe 
also  networked  for  speech  generation  and  recognition  re¬ 
spectively.  Even  more  so  than  Kismet’s  physical  form,  the 
control  network  is  rapidly  evolving  as  new  behaviors  and 
sensory  modalities  come  on  line. 

IV.  Levels  of  visual  behavior 

Visual  behavior  can  be  conceptualized  on  four  different 
levels  (as  shown  in  Figure  3).  These  levels  correspond  to 
the  social  level ,  the  behavior  level ,  the  skills  level ,  and  the 
primitives  level.  This  decomposition  is  motivated  by  dis¬ 
tinct  temporal,  perceptual,  and  interaction  constraints  that 
exist  at  each  level.  The  temporal  constraints  pertain  to 
how  fast  the  motor  acts  must  be  updated  and  executed. 
These  can  range  from  real-time  vision  rates  (30  Hz)  to  the 
relatively  slow  time  scale  of  social  interaction  (potentially 
transitioning  over  minutes).  The  perceptual  constraints 
pertain  to  what  level  of  sensory  feedback  is  required  to 
coordinate  behavior  at  that  layer.  This  perceptual  feed¬ 
back  can  originate  from  the  low  level  visual  processes  such 
as  the  current  target  from  the  attention  system,  to  rel¬ 
atively  high-level  multi-modal  percepts  generated  by  the 
behavioral  releasers.  The  interaction  constraints  pertain 
to  the  arbitration  of  units  that  compose  each  layer.  This 
can  range  from  low-level  oculomotor  primitives  (such  as 
saccades  and  smooth  pursuit),  to  using  visual  behavior  to 
regulate  human-robot  turn-taking. 


Fig.  3.  Levels  of  behavioral  organization.  The  primitive  level  is 
populated  with  tightly  coupled  sensorimotor  loops.  The  skill  level 
contains  modules  that  coordinate  primitives  to  achieve  tasks.  Behav¬ 
ior  level  modules  deal  with  questions  of  relevance,  persistence  and 
opportunism  in  the  arbitration  of  tasks.  The  social  level  comprises 
design-time  considerations  of  how  the  robot’s  behaviors  will  be  inter¬ 
preted  and  responded  to  in  a  social  environment. 

Each  level  serves  a  particular  purpose  for  generating  the 
overall  observed  behavior.  As  such,  each  level  must  ad¬ 
dress  a  specific  set  of  issues.  The  levels  of  abstraction  help 
simplify  the  overall  control  of  visual  behavior  by  restricting 
each  level  to  address  those  core  issues  that  are  best  man¬ 
aged  at  that  level.  By  doing  so,  the  coordination  of  visual 
behavior  at  each  level  (i.e.,  arbitration),  between  the  levels 


(i.e.,  top-down  and  bottom-up),  and  through  the  world  is 
maintained  in  a  principled  way. 

•  The  Social  Level:  The  social  level  explicitly  deals  with  is¬ 
sues  pertaining  to  having  a  human  in  the  interaction  loop. 
This  requires  careful  consideration  of  how  the  human  in¬ 
terprets  and  responds  to  the  robot’s  behavior  in  a  social 
context.  Using  visual  behavior  (making  eye  contact  and 
breaking  eye  contact)  to  help  regulate  the  transition  of 
speaker  turns  during  vocal  turn-taking  is  an  example. 

•  The  behavior  level:  The  behavior  level  deals  with  issues 
related  to  producing  relevant,  appropriately  persistent,  and 
opportunistic  behavior.  This  involves  arbitrating  between 
the  many  possible  goal-achieving  behaviors  that  the  robot 
could  perform  to  establish  the  current  task.  Actively  seek¬ 
ing  out  a  desired  stimulus  and  then  visually  engaging  it  is 
an  example. 

•  The  motor  skill  level:  The  motor  skill  level  is  responsi¬ 
ble  for  figuring  out  how  to  move  the  motors  to  accomplish 
that  task.  Fundamentally,  this  level  deals  with  the  issues 
of  blending  of  and  sequencing  between  coordinated  ensem¬ 
bles  of  motor  primitives  (each  ensemble  is  a  distinct  mo¬ 
tor  skill).  The  skills  level  must  also  deal  with  coordinating 
multi-modal  motor  skills  (e.g.,  those  motor  skills  that  com¬ 
bine  speech,  facial  expression,  and  body  posture).  Fixed 
action  patterns  such  as  a  searching  behavior  is  an  example 
where  the  robot  alternately  performs  ballistic  eye-neck  ori¬ 
entation  movements  with  gaze  fixation  to  the  most  salient 
target.  The  ballistic  movements  are  important  for  scan¬ 
ning  the  scene,  and  the  fixation  periods  are  important  for 
locking  on  the  desired  type  of  stimulus. 

•  The  motor  primitives  level:  The  motor  primitives  level 
implements  the  building  blocks  of  motor  action.  This  level 
must  deal  with  motor  resource  allocation  and  tightly  cou¬ 
pled  sensorimotor  loops.  For  example,  gaze  stabilization 
must  take  sensory  stimuli  and  produce  motor  commands 
in  a  very  tight  feedback  loop.  Kismet  actually  has  four 
distinct  motor  systems  at  the  primitives  level:  the  affective 
vocal  system ,  the  facial  expression  system ,  the  oculomotor 
system ,  and  the  body  posturing  system.  Because  this  paper 
focuses  on  visual  behavior,  we  only  discuss  the  oculomotor 
system  here. 

We  describe  these  levels  in  detail  as  they  pertain  to 
Kismet’s  visual  behavior.  We  begin  at  the  lowest  level, 
motor  primitives  pertaining  to  vision,  and  progress  to  the 
highest  level  where  we  discuss  the  social  constraints  of  an¬ 
imate  vision. 

V.  Visual  motor  primitives 

Kismet’s  visual-motor  control  is  modeled  sifter  the  hu¬ 
man  oculax-motor  system.  The  human  system  is  so  good 
at  providing  a  stable  percept  of  the  world  that  we  have 
no  intuitive  appreciation  of  the  physical  constraints  un¬ 
der  which  it  operates.  Humans  have  foveate  vision.  The 
fovea  (the  center  of  the  retina)  has  a  much  higher  density 
of  photoreceptors  than  the  periphery.  This  means  that  to 
see  an  object  clearly,  humans  must  move  their  eyes  such 
that  the  image  of  the  object  falls  on  the  fovea.  Human 
eye  movement  is  not  smooth.  It  is  composed  of  many 
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quick  jumps,  called  saccades,  which  rapidly  re-orient  the 
eye  to  project  a  different  part  of  the  visual  scene  onto  the 
fovea.  After  a  saccade,  there  is  typically  a  period  of  fixa¬ 
tion,  during  which  the  eyes  are  relatively  stable.  They  are 
by  no  means  stationary,  and  continue  to  engage  in  correc¬ 
tive  micro-saccades  and  other  small  movements.  If  the  eyes 
fixate  on  a  moving  object,  they  can  follow  it  with  a  con¬ 
tinuous  tracking  movement  called  smooth  pursuit.  This 
type  of  eye  movement  cannot  be  evoked  voluntarily,  but 
only  occurs  in  the  presence  of  a  moving  object.  Periods  of 
fixation  typically  end  after  some  hundreds  of  milliseconds, 
after  which  a  new  saccade  will  occur  [9]. 


VI.  PRE-ATTENTIVE  VISUAL  PERCEPTION 

Human  infants  and  adults  naturally  find  certain  percep¬ 
tual  features  interesting.  Features  such  as  color,  motion, 
and  face-like  shapes  are  very  likely  to  attract  our  attention 
[10].  We  have  implemented  a  variety  of  perceptual  feature 
detectors  that  are  particularly  relevant  to  interacting  with 
people  and  objects.  These  include  low-level  feature  detec¬ 
tors  attuned  to  quickly  moving  objects,  highly  saturated 
color,  and  colors  representative  of  skin  tones.  Examples 
of  features  wre  have  used  are  shown  in  Figure  5.  Looming 
objects  are  also  detected  pre-attentively,  to  facilitate  a  fast 
reflexive  withdrawal. 
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Fig.  4.  Humans  exhibit  four  characteristic  types  of  eye  motion. 
Saccadic  movements  are  high-speed  ballistic  motions  that  center  a 
target  in  the  field  of  view.  Smooth  pursuit  movements  are  used  to 
track  a  moving  object  at  low  velocities.  The  vestibulo-ocular  and 
opto-kinetic  reflexes  act  to  maintain  the  angle  of  gaze  as  the  head 
and  body  move  through  the  world.  Vergence  movements  serve  to 
maintain  an  object  in  the  center  of  the  field  of  view  of  both  eyes  as 
the  object  moves  in  depth. 

The  eyes  normally  move  in  lock-step,  making  equal,  con¬ 
junctive  movements.  For  a  close  object,  the  eyes  need  to 
turn  towards  each  other  somewhat  to  correctly  image  the 
object  on  the  foveae  of  the  two  eyes.  These  disjunctive 
movements  are  called  vergence,  and  rely  on  depth  percep¬ 
tion  (see  Figure  4).  Since  the  eyes  are  located  on  the  head, 
they  need  to  compensate  for  any  head  movements  that  oc¬ 
cur  during  fixation.  The  vestibulo-ocular  reflex  uses  iner¬ 
tial  feedback  from  the  vestibular  system  to  keep  the  orien¬ 
tation  of  the  eyes  stable  as  the  eyes  move.  This  is  a  very 
fast  response,  but  is  prone  to  the  accumulation  of  error 
over  time.  The  opto-kinetic  response  is  a  slower  compen¬ 
sation  mechanism  that  uses  a  measure  of  the  visual  slip  of 
the  image  across  the  retina  to  correct  for  drift.  These  two 
mechanisms  work  together  to  give  humans  stable  gaze  as 
the  head  moves. 


Fig.  5.  Overview  of  the  attention  system.  The  robot’s  attention 
is  determined  by  a  combination  of  low-level  perceptual  stimuli.  The 
relative  weightings  of  the  stimuli  are  modulated  by  high-level  behav¬ 
ior  and  motivational  influences.  A  sufficiently  salient  stimulus  in  any 
modality  can  pre-empt  attention,  similar  to  the  human  response  to 
sudden  motion.  All  else  being  equal,  larger  objects  are  considered 
more  salient  than  smaller  ones.  The  design  is  intended  to  keep  the 
robot  responsive  to  unexpected  events,  while  avoiding  making  it  a 
slave  to  every  whim  of  its  environment.  With  this  model,  people  in¬ 
tuitively  provide  the  right  cues  to  direct  the  robot’s  attention  (shake 
object,  move  closer,  wave  hand,  etc.).  Displayed  images  were  cap¬ 
tured  during  a  behavioral  trial  session. 

A.  Color  saliency  feature  map 

One  of  the  most  basic  and  widely  recognized  visual  fea¬ 
ture  is  color.  Our  models  of  color  saliency  are  drawn  from 
the  complementary  work  on  visual  search  and  attention 
from  Itti,  Koch,  and  Niebur  [11].  The  incoming  video 
stream  contains  three  8-bit  color  channels  (r,  g,  and  6) 
which  are  transformed  into  four  color-opponency  channels 
(r',  91,  V,  and  y ').  Each  input  color  channel  is  first  normal¬ 
ized  by  the  luminance  /  (a  weighted  average  of  the  three 
input  color  channels): 


Our  implementation  of  an  ocular-motor  system  is  an  ap¬ 
proximation  of  the  human  system.  The  motor  primitives 
are  organized  around  the  needs  of  higher  levels,  such  as 
maintaining  and  breaking  mutual  regard,  performing  vi¬ 
sual  search,  etc.  Since  our  motor  primitives  are  tightly 
bound  to  visual  attention,  we  will  first  discuss  their  sen¬ 
sory  component. 


rn 


255  r 
3  ‘  l 


255  g 

9n  =  ~i 


.  255  b 

bn  =  —  'l 


(1) 


These  normalized  color  channels  are  then  used  to  produce 
four  opponent- color  channels: 


r  —  rn  (<7n  bn) / 2 


(2) 
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9n  (r n  T  bn)/2 

(3) 

bn  ~  (f*n  d"  9n)/2 

(4) 

~  bn  —  ||rn  —  pn II 

(5) 

The  four  opponent-color  channels  are  clamped  to  8-bit  val¬ 
ues  by  thresholding.  While  some  research  seems  to  indicate 
that  each  color  channel  should  be  considered  individually 
[10],  we  choose  to  maintain  all  of  the  color  information  in  a 
single  feature  map  to  simplify  the  processing  requirements 
(as  does  Wolfe  [12]  for  more  theoretical  reasons). 


B.  Motion  feature  map 

In  parallel  with  the  color  saliency  computations,  a  second 
processor  receives  input  images  from  the  frame  grabber  and 
computes  temporal  differences  to  detect  motion.  Motion 
detection  is  performed  on  the  wide  FoV  camera,  which  is 
often  at  rest  since  it  does  not  move  with  the  eyes.  The 
incoming  image  is  converted  to  grayscale  and  placed  into  a 
ring  of  frame  buffers.  A  raw  motion  map  is  computed  by 
passing  the  absolute  difference  between  consecutive  images 
through  a  threshold  function  T : 


Mraw  =  T(\\It-It-l\\)  (6) 

This  raw  motion  map  is  then  smoothed  with  a  uniform 
7x8  field.  The  result  is  a  binary  2-D  map  where  regions 
corresponding  to  motion  have  a  high  intensity  value.  The 
motion  saliency  feature  map  is  computed  at  25-30  Hz  by  a 
single  400MHz  processor  node.  Figure  5  gives  an  example 
of  the  motion  feature  map  when  the  robot  looks  at  a  toy 
block  that  is  being  shaken. 

C.  Skin  tone  feature  map 

Colors  consistent  with  skin  are  also  filtered  for.  This  is  a 
computationally  inexpensive  means  to  rule  out  regions  that 
are  unlikely  to  contain  faces  or  hands.  Most  pixels  on  faces 
will  pass  these  tests  over  a  wide  range  of  lighting  conditions 
and  skin  color.  Pixels  that  pass  these  tests  are  weighted 
according  to  a  function  learned  from  instances  of  skin  tone 
from  images  taken  by  Kismet’s  cameras  (see  figure  6).  In 
this  implementation,  a  pixel  is  not  skin-toned  if: 

•  r  <  1.1  g  (the  red  component  fails  to  dominate  green 
sufficiently) 

•  r  <  0.9  •  b  (the  red  component  is  excessively  dominated 
by  blue) 

•  r  >  2.0  •  max(g,  b)  (the  red  component  completely  dom¬ 
inates  both  blue  and  green) 

•  r  <  20  (the  red  component  is  too  low  to  give  good  esti¬ 
mates  of  ratios) 

•  r  >  250  (the  red  component  is  too  saturated  to  give  a 
good  estimate  of  ratios) 

VII.  Visual  attention 

We  have  implemented  Wolfe’s  model  of  human  visual 
search  and  attention  [12].  Our  implementation  is  similar 


Fig.  6.  The  skin  tone  filter  responds  to  4.7%  of  possible  (/?,  G,  B) 
values.  Each  grid  in  the  figure  to  the  left  shows  the  response  of  the 
filter  to  all  values  of  red  and  green  for  a  fixed  value  of  blue.  The  image 
to  the  right  shows  the  filter  in  operation.  Typical  indoor  objects  that 
may  also  be  consistent  with  skin  tone  include  wooden  doors,  cream 
walls,  etc. 


to  other  models  based  in  part  on  Wolfe’s  work  [11],  but  ad¬ 
ditionally  operates  in  conjunction  with  motivational  and 
behavioral  models,  with  moving  cameras,  and  addresses 
the  issue  of  habituation.  The  attention  process  acts  in  two 
parts.  The  low-level  feature  detectors  discussed  in  the  pre¬ 
vious  section  are  combined  through  a  weighted  average  to 
produce  a  single  attention  map.  This  combination  allows 
the  robot  to  select  regions  that  are  visually  salient  and  to 
direct  its  computational  and  behavioral  resources  towards 
those  regions.  The  attention  system  also  integrates  influ¬ 
ences  from  the  robot’s  internal  motivational  and  behavioral 
systems  to  bias  the  selection  process.  For  example,  if  the 
robot’s  current  goal  is  to  interact  with  people,  the  attention 
system  is  biased  toward  objects  that  have  colors  consistent 
with  skin  tone.  The  attention  system  also  has  mechanisms 
for  habituating  to  stimuli,  thus  providing  the  robot  with  a 
primitive  attention  span.  The  state  of  the  attention  sys¬ 
tem  is  usually  reflected  in  the  robot’s  gaze  direction,  unless 
there  are  behavioral  reasons  for  this  not  to  be  the  case.  The 
attention  system  runs  all  the  time,  even  when  it  is  not  con¬ 
trolling  gaze,  since  it  determines  the  perceptual  input  to 
which  the  motivational  and  behavioral  systems  respond. 

A.  Task-based  influences  on  attention 

For  a  goal  achieving  creature,  the  behavioral  state  should 
also  bias  what  the  creature  attends  to  next.  For  instance, 
when  performing  visual  search,  humans  seem  to  be  able 
to  preferentially  select  the  output  of  one  broadly  tuned 
channel  per  feature  (e.g.  “red”  for  color  and  “shallow”  for 
orientation  if  searching  for  red  horizontal  lines). 

In  our  system  these  top-down,  behavior-driven  factors 
modulate  the  output  of  the  individual  feature  maps  before 
they  are  summed  to  produce  the  bottom-up  contribution. 
This  process  selectively  enhances  or  suppresses  the  contri¬ 
bution  of  certain  features,  but  does  not  alter  the  underlying 
raw  saliency  of  a  stimulus.  To  implement  this,  the  bottom- 
up  results  of  each  feature  map  are  passed  through  a  filter 
(effectively  a  gain).  The  value  of  each  gain  is  determined 
by  the  active  behavior.  For  instance,  as  shown  in  Figure  7, 
the  skin  tone  gain  is  enhanced  when  the  seek  people  be¬ 
havior  is  active  and  is  suppressed  when  the  avoid  people 
behavior  is  active.  Similarly,  the  color  gain  is  enhanced 
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Fig.  7.  Schematic  of  behaviors  relevant  to  attention.  The  activa¬ 
tion  of  a  particular  behavior  depends  on  both  perceptual  factors  and 
motivation  factors.  The  perceptual  factors  come  from  post  attentive 
processing  of  the  target  stimulus  into  behaviorally  relevant  percepts. 
The  drives  within  the  motivation  system  have  an  indirect  influence 
on  attention  by  influencing  the  behavioral  context.  The  behaviors  at 
Level  1  of  the  behavior  system  directly  manipulate  the  gains  of  the 
attention  system  to  benefit  their  goals.  Through  behavior  arbitration, 
only  one  of  these  behaviors  is  active  at  any  time.  These  behaviors 
are  further  elaborated  in  deeper  levels  of  the  behavior  system. 

when  the  seek  toys  behavior  is  active,  and  suppressed 
when  the  avoid  toys  behavior  is  active.  Whenever  the 
engage  people  or  engage  toys  behaviors  are  active,  the 
face  and  color  gains  are  restored  to  their  default  values, 
respectively. 


Fig.  8.  Effect  of  gain  adjustment  on  looking  preference.  Circles 
correspond  to  fixation  points,  sampled  at  one  second  intervals.  On 
the  left,  the  gain  of  the  skin  tone  filter  is  higher.  The  robot  spends 
more  time  looking  at  the  face  in  the  scene  (86%  face,  14%  block). 
This  bias  occurs  despite  the  fact  that  the  face  is  dwarfed  by  the  block 
in  the  visual  scene.  On  the  right,  the  gain  of  the  color  saliency  filter 
is  higher.  The  robot  now  spends  more  time  looking  at  the  brightly 
colored  block  (28%  face,  72%  block). 

These  modulated  feature  maps  are  then  summed  to  com¬ 
pute  the  overall  attention  activation  map,  thus  biasing  at¬ 
tention  in  a  way  that  facilitates  achieving  the  goal  of  the 
active  behavior.  For  example,  if  the  robot  is  searching  for 
social  stimuli,  it  becomes  sensitive  to  skin  tone  and  less  sen¬ 
sitive  to  color.  Behaviorally,  the  robot  may  encounter  toys 
in  its  search,  but  will  continue  until  a  skin  toned  stimulus  is 
found  (often  a  person’s  face).  Figure  8  shows  the  results  of 
two  such  experiments.  The  left  figure  shows  a  looking  pref¬ 
erence  to  a  person  despite  a  lesser  Wraw”  saliency  when  the 
robot  is  seeking  out  people.  Conversely,  when  the  robot  is 


actively  searching  for  a  toy,  it  demonstrates  a  looking  pref¬ 
erence  to  the  colorful  block  depite  the  dominant  presence 
of  a  person’s  face  in  the  visual  field. 

B.  Habituation  effects 

To  build  a  believable  creature,  the  attention  system 
must  also  implement  habituation  effects.  Infants  respond 
strongly  to  novel  stimuli,  but  soon  habituate  and  respond 
less  as  familiarity  increases.  This  acts  both  to  keep  the 
infant  from  being  continually  fascinated  with  any  single 
object  and  to  force  the  caretaker  to  continually  engage  the 
infant  with  slightly  new  and  interesting  interactions.  For 
a  robot,  a  habituation  mechanism  removes  the  effects  of 
highly  salient  background  objects  that  are  not  currently  in¬ 
volved  in  direct  interactions  as  well  as  placing  requirements 
on  the  caretaker  to  maintain  interaction  with  slightly  novel 
stimulation. 

To  implement  habituation  effects,  a  habituation  filter  is 
applied  to  the  activation  map  over  the  location  currently 
being  attended  to.  The  habituation  filter  effectively  decays 
the  activation  level  of  the  location  currently  being  attended 
to,  making  other  locations  of  lesser  activation  bias  atten¬ 
tion  more  strongly. 

C.  Consistency  of  attention 

In  the  presence  of  objects  of  similar  salience,  it  is  useful 
be  able  to  commit  attention  to  one  of  the  objects  for  a  pe¬ 
riod  of  time.  This  gives  time  for  post-attentive  processing 
to  be  carried  out  on  the  object,  and  for  downstream  pro¬ 
cesses  to  organize  themselves  around  the  object.  As  soon 
as  a  decision  is  made  that  the  object  is  not  behaviorally 
relevant  (for  example,  it  may  lack  eyes,  which  are  searched 
for  post-attentively),  attention  can  be  withdrawn  from  it 
and  visual  search  may  continue.  Committing  to  an  object 
is  also  useful  for  behaviors  that  need  to  be  atomically  ap¬ 
plied  to  a  target  (for  example,  a  calling  behavior  where  the 
robot  needs  to  stay  looking  at  the  person  it  is  calling). 

To  allow  such  commitment,  the  attention  system  is  aug¬ 
mented  with  a  tracker.  The  tracker  follows  a  target  in 
the  visual  field,  using  simple  correlation  between  succes¬ 
sive  frames.  Usually  changes  in  the  tracker  target  will  be 
reflected  in  movements  of  the  robot’s  eyes,  unless  this  is 
behaviorally  inappropriate.  If  the  tracker  loses  the  target, 
it  has  a  very  good  chance  of  being  able  to  reacquire  it  from 
the  attention  system. 

D.  Experiments  with  directing  attention 

The  overall  attention  system  runs  at  20  Hz  on  several  400 
MHz  processors.  In  section  VII- A,  we  presented  Kismet’s 
looking  preference  results  with  respect  to  directing  its  at¬ 
tention  to  task- relevant  stimuli.  In  this  section,  we  examine 
how  easy  it  is  for  people  to  direct  the  Kismet’s  attention 
to  a  specific  target  stimulus,  and  to  determine  when  they 
have  been  successful  in  doing  so. 

Three  naive  subjects  were  invited  to  interact  with 
Kismet.  The  subjects  ranged  in  age  from  25  to  28  years 
old.  All  used  computers  frequently  but  were  not  computer 
scientists  by  training.  All  interactions  were  video-recorded. 
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The  robot’s  attention  gains  were  set  to  their  default  values 
so  that  there  would  be  no  strong  preference  for  one  saliency 
feature  over  another.  The  subjects  were  asked  to  direct  the 
robot’s  attention  to  each  of  the  target  stimuli.  There  were 
seven  target  stimuli  used  in  the  study.  Three  were  satu¬ 
rated  color  stimuli,  three  were  skin-toned  stimuli,  and  the 
last  was  a  pure  motion  stimulus.  Each  target  stimulus  was 
used  more  than  once  per  subject.  These  are  listed  below: 

•  A  highly  saturated  colorful  block 

•  A  bright  yellow  stuffed  dinosaur  with  multi-color  spines 

•  A  bright  green  cylinder 

•  A  bright  pink  cup  (which  is  actually  detected  by  the  skin 
tone  feature  map) 

•  The  person’s  face 

•  The  person’s  hand 

•  A  black  and  white  plush  cow  (which  is  only  salient  when 
moving) 

The  video  was  later  analyzed  to  determine  which  cues 
the  subjects  used  to  attract  the  robot’s  attention,  which 
cues  they  used  to  determine  when  they  had  been  success¬ 
ful,  and  the  length  of  time  required  to  do  so.  They  were 
also  interviewed  at  the  end  of  the  session  about  which  cues 
they  used,  which  cues  they  read,  and  about  how  long  they 
thought  it  took  to  direct  the  robot’s  attention.  The  results 
are  summarized  in  Table  I. 


stimulus 

category 

stimulus 

presentations 

average 
time  (s) 

commonly 
used  cues 

commonly 
read  cues 

ootori 

movement 

yellow  dinosaur 

8 

8.5 

motion  across 
center  line 

shaking  motion 

bnnging  target 
close  to  robot 

eye  behavior. 

especially 

tracking 

facial  expression, 
especially  raised 
eyebrows 

body  posture, 
especially 
forward  lean 
or  withdraw 

multi-colored 

block 

8 

6.5 

green  cylinder 

8 

6.0 

motion 

only 

b/w  cow 

8 

5.0 

skin -toned 
& 

movement 

pink  cup 

8 

6.5 

hand 

8 

5.0 

face 

8 

3.5 

Total 

56 

*^5.8 

TABLE  I 

Summary  from  attention  manipulation  interactions. 


To  attract  the  robot’s  attention,  the  most  frequently 
used  cues  include  bringing  the  target  close  and  in  front  of 
the  robot’s  face,  shaking  the  object  of  interest,  or  moving 
it  slowly  across  the  centerline  of  the  robot’s  face.  Each  cue 
increases  the  saliency  of  a  stimulus  by  making  it  appear 
larger  in  the  visual  field,  or  by  supplementing  the  color 
or  skin-tone  cue  with  motion.  Note  that  there  was  an  in¬ 
herent  competition  between  the  saliency  of  the  target  and 
the  subject’s  own  face  as  both  could  be  visible  from  the 
wide  FoV  camera.  If  the  subject  did  not  try  to  direct  the 
robot’s  attention  to  the  target,  the  robot  tended  to  look  at 
the  subject’s  face. 


The  subjects  also  effortlessly  determined  when  they  had 
successfully  re-directed  the  robot’s  gaze.  Interestingly,  it  is 
not  sufficient  for  the  robot  to  orient  to  the  target.  People 
look  for  a  change  in  visual  behavior,  from  ballistic  orienta¬ 
tion  movements  to  smooth  pursuit  movements,  before  con¬ 
cluding  that  they  had  successfully  re-directed  the  robot’s 
attention.  All  subjects  reported  that  eye  movement  was 
the  most  relevant  cue  to  determine  if  they  had  successfully 
directed  the  robot’s  attention.  They  all  reported  that  it 
was  easy  to  direct  the  robot’s  attention  to  the  desired  tar¬ 
get.  They  estimated  the  mean  time  to  direct  the  robot’s 
attention  at  5  to  10  seconds.  This  turns  out  to  be  the  case; 
the  mean  time  over  all  trials  and  all  targets  is  5.8  seconds. 

VIII.  Post-attentive  processing 

Once  the  attention  system  has  selected  regions  of  the  vi¬ 
sual  field  that  are  potentially  behaviorally  relevant,  more 
intensive  computation  can  be  applied  to  these  regions  than 
could  be  applied  across  the  whole  field.  Searching  for  eyes 
is  one  such  task.  Locating  eyes  is  important  to  us  for  engag¬ 
ing  in  eye  contact,  and  as  a  reference  point  for  interpreting 
facial  movements  and  expressions.  We  currently  search  for 
eyes  after  the  robot  directs  its  gaze  to  a  locus  of  attention, 
so  that  a  relatively  high  resolution  image  of  the  area  being 
searched  is  available  from  the  foveal  cameras  (see  Figure 
9).  Once  the  target  of  interest  has  been  selected,  we  also 
estimate  its  proximity  to  the  robot  using  a  stereo  match 
between  the  two  central  wide  cameras.  Proximity  is  an  im¬ 
portant  for  interaction  as  things  closer  to  the  robot  should 
be  of  greater  interest.  It’s  also  useful  for  interaction  at 
a  distance,  such  as  a  person  standing  too  far  for  face  to 
face  interaction  but  is  close  enough  to  be  beckoned  closer. 
Clearly  the  relevant  behavior  (beckoning  or  playing)  is  de¬ 
pendent  on  the  proximity  of  the  human  to  the  robot. 


Fig.  9.  Eyes  are  searched  for  within  a  restricted  part  of  the  robot’s 
field  of  view.  The  eye  detector  actually  looks  for  the  region  between 
the  eyes.  It  has  adequate  performance  over  a  limited  range  of  dis¬ 
tances  and  face  orientations. 

Eye-detection  in  a  real-time,  robotic  domain  is  compu¬ 
tationally  expensive  and  prone  to  error  due  to  the  large 
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variance  in  head  posture,  lighting  conditions  and  feature 
scales.  We  developed  an  approach  based  on  successive  fea¬ 
ture  extraction,  combined  with  some  inherent  domain  con¬ 
straints,  to  achieve  a  robust  and  fast  eye-detection  system 
for  Kismet.  First,  a  set  of  feature  filters  are  applied  succes¬ 
sively  to  the  image  in  increasing  feature  granularity.  This 
serves  to  reduce  the  computational  overhead  while  main¬ 
taining  a  robust  system.  The  successive  filter  stages  are: 

•  Detect  skin  colored  patches  in  the  image  (abort  if  this 
does  not  pass  above  threshold). 

•  Scan  the  image  for  ovals  and  characterize  its  skin  tone 
for  a  potential  face. 

•  Extract  a  sub-image  of  the  oval  and  run  a  ratio  template 
[13],  [14]  over  it  for  candidate  eye  locations. 

•  For  each  candidate  eye  location,  run  a  pixel  based  multi¬ 
layer  percept ron  on  the  region.  The  perceptron  is  previ¬ 
ously  trained  to  recognize  shading  patterns  characteristic 
of  the  eyes  and  bridge  of  the  nose. 

By  doing  so,  the  set  of  possible  eye-locations  in  the  im¬ 
age  is  reduced  from  the  previous  level  based  on  a  feature 
filter.  This  allows  the  eye  detector  to  run  in  real  time  on 
a  400Mhz  PC.  The  methodology  assumes  that  the  light¬ 
ing  conditions  allow  the  eyes  to  be  distinguished  as  dark 
regions  surrounded  by  highlights  of  the  temples  and  the 
bridge  of  the  nose,  that  human  eyes  are  largely  surrounded 
by  regions  of  skin  color,  that  the  head  is  only  moderately 
rotated,  that  the  eyes  are  reasonably  horizontal,  and  that 
people  are  within  interaction  distance  from  the  robot  (3  to 
10  feet). 


IX.  Eye  movements 

Kismet’s  eyes  periodically  saccade  to  new  targets  chosen 
by  an  attention  system,  tracking  them  smoothly  if  they 
move  and  the  robot  wishes  to  engage  them.  Vergence  eye 
movements  axe  more  challenging  to  implement  in  a  social 
setting,  since  errors  in  disjunctive  eye  movements  can  give 
the  eyes  a  disturbing  appearance  of  moving  independently. 
Errors  in  conjunctive  movements  have  a  much  smaller  im¬ 
pact  on  an  observer,  since  the  eyes  clearly  move  in  lock- 
step.  A  crude  approximation  of  the  opto-kinetic  reflex  is 
rolled  into  our  implementation  of  smooth  pursuit.  An  ana¬ 
logue  of  the  vestibular-ocular  reflex  has  been  developed  us¬ 
ing  a  3-axis  inertial  sensor,  but  has  yet  to  be  implemented 
on  Kismet  (it  currently  runs  on  other  humanoid  robots  in 
our  lab).  Kismet  uses  an  efferent  copy  mechanism  to  com¬ 
pensate  the  eyes  for  movements  of  the  head.  An  over  of 
the  oculomotor  control  system  is  shown  in  figure  10. 

The  attention  system  operates  on  the  view  from  the  cen¬ 
tral  camera  (see  Figure  2).  A  transformation  is  needed  to 
convert  pixel  coordinates  in  images  from  this  camera  into 
position  setpoints  for  the  eye  motors.  This  transforma¬ 
tion  in  general  requires  the  distance  to  the  target  to  be 
known,  since  objects  in  many  locations  will  project  to  the 
same  point  in  a  single  image  (see  Figure  11).  Distance  esti¬ 
mates  are  often  noisy,  which  is  problematic  if  the  goal  is  to 
center  the  target  exactly  in  the  eyes.  In  practice,  it  is  usu¬ 
ally  enough  to  get  the  target  within  the  field  of  view  of  the 
foveal  cameras  in  the  eyes.  Clearly  the  narrower  the  field  of 


Wkto  Wide  Left  Foveal  Right  Fovaal 


Fig.  10.  Organization  of  Kismet’s  eye/neck  motor  control.  Many 
cross  level  influences  have  been  omitted.  The  modules  in  gray  are 
not  active  in  the  results  presented  in  this  paper. 


view  of  these  cameras  is,  the  more  accurately  the  distance 
to  the  object  needs  to  be  known.  Other  crucial  factors  are 
the  distance  between  the  wide  and  foveal  cameras,  and  the 
closest  distance  at  which  the  robot  will  need  to  interact 
with  objects.  These  constraints  determined  the  physical 
distribution  of  Kismet’s  cameras  and  choice  of  lenses.  The 
central  location  of  the  wide  camera  places  it  as  close  as 
possible  to  the  foveal  cameras.  It  also  has  the  advantage 
that  moving  the  head  to  center  a  target  as  seen  in  the  cen¬ 
tral  camera  will  in  fact  truly  orient  the  head  towards  that 
target  for  cameras  in  other  locations,  accuracy  of  orienta¬ 
tion  would  be  limited  by  the  accuracy  of  the  measurement 
of  distance. 


New  field 


Fig.  11.  Without  distance  information,  knowing  the  position  of  a 
target  in  the  wide  camera  only  identifies  a  ray  along  which  the  object 
must  lie,  and  does  not  uniquely  identify  its  location.  If  the  cameras 
are  close  to  each  other  relative  to  the  closest  distance  the  object  is 
expected  to  be  at,  the  foveal  cameras  can  be  rotated  to  bring  the 
object  within  their  narrow  field  of  view  without  needing  an  accurate 
estimate  of  its  distance.  If  the  cameras  are  far  apart,  or  the  field 
of  view  is  very  narrow,  the  minimum  distance  the  object  can  be  at 
becomes  large. 


Higher-level  influences  modulate  eye  and  neck  move¬ 
ments  in  a  number  of  ways.  As  already  discussed,  mod¬ 
ifications  to  weights  in  the  attention  system  translate  to 
changes  of  the  locus  of  attention  about  which  eye  move- 
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ments  axe  organized.  The  regime  used  to  control  the  eyes 
and  neck  is  available  as  a  set  of  primitives  to  higher-level 
modules.  Regimes  include  low-commitment  search,  high- 
commitment  engagement,  avoidance,  sustained  gaze,  and 
deliberate  gaze  breaking.  The  primitive  percepts  gener¬ 
ated  by  this  level  include  a  characterization  of  the  most 
salient  regions  of  the  image  in  terms  of  the  feature  maps, 
an  extended  characterization  of  the  tracked  region  in  terms 
of  the  results  of  post-attentive  processing  (eye  detection, 
distance  estimation),  and  signals  related  to  undesired  con¬ 
ditions,  such  as  a  looming  object,  or  an  object  moving  at 
speeds  the  tracker  finds  difficult  to  keep  up  with. 

X.  Visual  motor  skills 

Given  the  current  task  (as  dictated  by  the  behavior  sys¬ 
tem),  the  motor  skills  level  is  responsible  for  figuring  out 
how  to  move  the  actuators  to  carry  out  the  stated  goal. 
Often  this  requires  coordination  between  multiple  motor 
modalities  (speech,  body  posture,  facial  display,  and  gaze 
control).  Requests  for  these  modalities  can  originate  from 
the  top-down  (e.g.  from  the  emotion  system  or  behavior 
system),  as  well  as  from  the  bottom-up  (the  vocal  system 
requesting  lip  and  jaw  movements  for  lip  synching).  Hence, 
the  motor  skills  level  must  address  the  issue  of  servicing  the 
motor  requests  of  different  systems  across  the  different  mo¬ 
tor  resources. 

Furthermore,  it  often  requires  a  sequence  of  coordinated 
motor  movements  to  satisfy  a  goal.  Each  motor  movement 
is  a  primitive  (or  a  combination  of  primitives)  from  one  of 
the  base  motor  systems  (the  vocal  system,  the  oculomotor 
system,  etc.).  Each  of  these  coordinated  series  of  motor 
primitives  is  called  a  skill,  and  each  skill  is  implemented 
as  a  finite  state  machine  (FSM).  Each  motor  skill  encodes 
knowledge  of  how  to  move  from  one  motor  state  to  the 
next,  where  each  sequence  is  designed  to  bring  the  robot 
closer  to  the  current  goal.  The  motor  skills  level  must 
arbitrate  among  the  many  different  FSMs,  selecting  the  one 
to  become  active  based  on  the  active  goal.  This  decision 
process  is  straight  forward  since  there  is  an  FSM  tailored 
for  each  task  of  the  behavior  system. 

Many  skills  can  be  thought  of  as  a  fixed  action  pattern 
(FAP)  as  conceptualized  by  early  ethologists.  Each  FAP 
consists  of  two  components,  the  action  component  and  the 
taxis  (or  orienting)  component.  For  Kismet,  FAPs  often 
correspond  to  communicative  gestures  where  the  action 
component  corresponds  to  the  facial  gesture,  and  the  taxis 
component  (to  whom  the  gesture  is  directed)  is  controlled 
by  gaze.  People  seem  to  intuitively  understand  that  when 
Kismet  makes  eye  contact  with  them,  they  are  the  locus 
of  Kismet’s  attention  and  the  robot’s  behavior  is  organized 
about  them.  This  places  the  person  in  a  state  of  action 
readiness  where  they  are  poised  to  respond  to  Kismet’s 
gestures. 

A  simple  example  of  a  motor  skill  is  Kismet’s  ‘‘calling” 
FAP.  When  the  current  task  is  to  bring  a  person  into  a 
good  interaction  distance,  the  motor  skill  system  activates 
the  calling  FSM.  The  taxis  component  of  the  FAP  issues 
a  hold  gaze  request  to  the  oculomotor  system.  This  serves 


to  maintain  the  robot’s  gaze  on  the  person  to  be  hailed.  In 
the  first  state  of  the  gestural  component,  Kismet  leans  its 
body  toward  the  person  (a  request  to  the  body  posture 
motor  system).  This  strengthens  the  person’s  perception 
that  the  robot  has  taken  a  particular  interest  in  them.  The 
ears  also  begin  to  waggle  exuberantly  (creating  a  significant 
amount  of  motion  and  noise)  which  further  attracts  the 
person’s  attention  to  the  robot  (a  request  to  the  face  mo¬ 
tor  system).  In  addition,  Kismet  vocalizes  excitedly  which 
is  perceived  as  an  initiation  of  engagement.  At  the  com¬ 
pletion  of  this  gesture,  the  FSM  transitions  to  the  second 
state.  In  this  state,  the  robot  “sits  back”  and  waits  for  a 
bit  with  an  expecting  expression  (ears  slightly  perked,  eyes 
slightly  widened,  and  brows  raised).  If  the  person  has  not 
already  approached  the  robot,  it  is  likely  to  occur  during 
this  “anticipation”  phase.  If  the  person  does  not  approach 
within  the  allotted  time  period,  the  FSM  transitions  to  the 
third  state  in  which  the  face  relaxes,  the  robot  maintains 
a  neutral  posture,  and  gaze  fixation  is  released.  At  this 
point,  the  robot  is  likely  to  shift  gaze.  As  long  as  this  FSM 
is  active  (determined  by  the  behavior  system),  the  hailing 
cycle  repeats.  It  can  be  interrupted  at  any  state  transition 
by  the  activation  of  another  FSM  (such  as  the  “greeting” 
FSM  when  the  person  has  approached). 

We  now  move  up  another  layer  of  abstraction,  to  the 
behavior  level  in  the  hierarchy  that  was  shown  in  Figure  3. 

XI.  Visual  behavior 

The  behavior  level  is  responsible  for  establishing  the  cur¬ 
rent  task  for  the  robot  through  arbitrating  among  Kismet’s 
goal- achieving  behaviors.  By  doing  so,  the  observed  behav¬ 
ior  should  be  relevant,  appropriately  persistent,  and  oppor¬ 
tunistic.  Both  the  current  environmental  conditions  (as 
characterized  by  high-level  perceptual  releasers,  as  well  as 
motivational  factors  (emotion  processes  and  homeostatic 
regulation)  contribute  to  this  decision  process. 

Interaction  of  the  behavior  level  with  the  social  level  oc¬ 
curs  through  the  world  as  determined  by  the  nature  of  the 
interaction  between  Kismet  and  the  human.  As  the  hu¬ 
man  responds  to  Kismet,  the  robot’s  perceptual  conditions 
change.  This  can  activate  a  different  behavior,  whose  goal 
is  physically  carried  out  the  underlying  motor  systems.  The 
human  observes  the  robot’s  ensuing  response  and  shapes 
their  reply  accordingly. 

Interaction  of  the  behavior  level  with  the  motor  skills 
level  also  occurs  through  the  world.  For  instance,  if  Kismet 
is  looking  for  a  bright  toy,  then  the  seek  toy  behavior  is 
active.  This  task  is  passed  to  the  underlying  motor  skills 
which  carry  out  the  search.  The  act  of  scanning  the  envi¬ 
ronment  brings  new  perceptions  to  Kismet’s  field  of  view. 
If  a  toy  is  found,  then  the  seek  toy  behavior  is  successful 
and  released.  At  this  point,  the  perceptual  conditions  for 
engaging  the  toy  are  relevant  and  the  engage  toy  behav¬ 
iors  become  active.  A  new  set  motor  skills  become  active 
to  track  and  smoothly  pursue  the  toy. 
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A.  Social  level 

Eye  movements  have  communicative  value.  As  discussed 
previously,  they  indicate  the  robot’s  locus  of  attention.  The 
robot’s  degree  of  engagement  can  also  be  conveyed,  to  com¬ 
municate  how  strongly  the  robot’s  behavior  is  organized 
around  what  it  is  currently  looking  at.  If  the  robot’s  eyes 
flick  about  from  place  to  place  without  resting,  that  in¬ 
dicates  a  low  level  of  engagement,  appropriate  to  a  visual 
search  behavior.  Prolonged  fixation  with  smooth  pursuit 
and  orientation  of  the  head  towards  the  target  conveys 
a  much  greater  level  of  engagement,  suggesting  that  the 
robot’s  behavior  is  very  strongly  organized  about  the  locus 
of  attention. 


O 


Person  Person  draws 

backs  off  closer 


Comfortable  interaction 


o 


yT^speed 

XJU  ' 


Too  fast, 
Too  dose 
threat  response 


Too  fast  - 
irritation  response 


Fig.  12.  Regulating  interaction.  People  too  distant  to  be  seen  clearly 
are  called  closer;  if  they  come  too  close,  the  robot  signals  discomfort 
and  withdraws.  The  withdrawal  moves  the  robot  back  somewhat 
physically,  but  is  more  effective  in  signaling  to  the  human  to  back  off. 
Toys  or  people  that  move  too  rapidly  cause  irritation. 


Eye  movements  are  the  most  obvious  and  direct  motor 
actions  that  support  visual  perception.  But  they  are  by 
no  means  the  only  ones.  Postural  shifts  and  fixed  action 
patterns  involving  the  entire  robot  also  have  an  important 
role.  Kismet  has  a  number  of  coordinated  motor  actions 
designed  to  deal  with  various  limitations  of  Kismet’s  vi¬ 
sual  perception  (see  Figure  12).  For  example,  if  a  person 
is  visible,  but  is  too  distant  for  their  face  to  be  imaged  at 
adequate  resolution,  Kismet  engages  in  a  calling  behavior 
to  summon  the  person  closer.  People  who  come  too  close 
to  the  robot  also  cause  difficulties  for  the  cameras  with 
narrow  fields  of  view,  since  only  a  small  part  of  a  face  may 
be  visible.  In  this  circumstance,  a  withdrawal  response  is 
invoked,  where  Kismet  draws  back  physically  from  the  per¬ 
son.  This  behavior,  by  itself,  aids  the  cameras  somewhat 
by  increasing  the  distance  between  Kismet  and  the  human. 
But  the  behavior  can  have  a  secondary  and  greater  effect 
through  social  amplification  for  a  human  close  to  Kismet, 
a  withdrawal  response  is  a  strong  social  cue  to  back  away, 
since  it  is  analogous  to  the  human  response  to  invasions  of 
“personal  space.” 

Similar  kinds  of  behavior  can  be  used  to  support  the  vi¬ 
sual  perception  of  objects.  If  an  object  is  too  close,  Kismet 


can  lean  away  from  it;  if  it  is  too  far  away,  Kismet  can  crane 
its  neck  towards  it.  Again,  in  a  social  context,  such  actions 
have  power  beyond  their  immediate  physical  consequences. 
A  human,  reading  intent  into  the  robot’s  actions,  may  am¬ 
plify  those  actions.  For  example,  neck-craning  towards  a 
toy  may  be  interpreted  as  interest  in  that  toy,  resulting 
in  the  human  bringing  the  toy  closer  to  the  robot.  An¬ 
other  limitation  of  the  visual  system  is  how  quickly  it  can 
track  moving  objects.  If  objects  or  people  move  at  excessive 
speeds.  Kismet  has  difficulty  tracking  them  continuously. 
To  bias  people  away  from  excessively  boisterous  behavior 
in  their  own  movements  or  in  the  movement  of  objects  they 
manipulate,  Kismet  shows  irritation  when  its  tracker  is  at 
the  limits  of  its  ability.  These  limits  are  either  physical 
(the  maximum  rate  at  which  the  eyes  and  neck  move),  or 
computational  (the  maximum  displacement  per  frame  from 
the  cameras  over  which  a  target  is  searched  for). 

Such  regulatory  mechanisms  play  roles  in  more  com¬ 
plex  social  interactions,  such  as  conversational  turn-  tak¬ 
ing.  Here  control  of  gaze  direction  is  important  for  regu¬ 
lating  conversation  rate  [15].  In  general,  people  are  likely 
to  glance  aside  when  they  begin  their  turn,  and  make  eye 
contact  when  they  are  prepared  to  relinquish  their  turn 
and  await  a  response.  Blinks  occur  most  frequently  at  the 
end  of  an  utterance.  These  and  other  cues  allow  Kismet 
to  influence  the  flow  of  conversation  to  the  advantage  of 
its  auditory  processing.  The  visual-motor  system  can  also 
be  driven  by  the  requirements  of  a  nominally  unrelated 
sensory  modality,  just  as  behaviors  that  seem  completely 
orthogonal  to  vision  (such  as  ear-wiggling  during  the  call 
behavior  to  attract  a  person’s  attention)  are  nevertheless 
recruited  for  the  purposes  of  regulation.  These  mecha¬ 
nisms  also  help  protect  the  robot.  Objects  that  suddenly 
appear  close  to  the  robot  trigger  a  looming  reflex,  caus¬ 
ing  the  robot  to  quickly  withdraw  and  appear  startled.  If 
the  event  is  repeated,  the  response  quickly  habituates  and 
the  robot  simply  appears  annoyed,  since  its  best  strategy 
for  ending  these  repetitions  is  to  clearly  signal  that  they 
are  undesirable.  Similarly,  rapidly  moving  objects  close  to 
the  robot  are  threatening  and  trigger  an  escape  response. 
These  mechanisms  are  all  designed  to  elicit  natural  and  in¬ 
tuitive  responses  from  humans,  without  any  special  train¬ 
ing.  But  even  without  these  carefully  crafted  mechanisms, 
it  is  often  clear  to  a  human  when  Kismet’s  perception  is 
failing,  and  what  corrective  action  would  help,  because  the 
robot’s  perception  is  reflected  in  behavior  in  a  familiar  way. 
Inferences  made  based  on  our  human  preconceptions  are 
actually  likely  to  work. 

B.  Evidence  of  Social  Amplification 

To  evaluate  the  social  implications  of  Kismet’s  behav¬ 
ior,  we  invited  a  few  people  to  interact  with  the  robot  in  a 
free-form  exchange.  There  were  four  subjects  in  the  study, 
two  males  (one  adult  and  one  child)  and  two  females  (both 
adults).  They  ranged  in  age  from  12  to  28  years.  None  of 
the  subjects  were  affiliated  with  MIT.  All  had  substantial 
experience  with  computers.  None  of  the  subjects  had  any 
prior  experience  with  Kismet.  The  child  had  prior  experi- 
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ence  with  a  variety  of  interactive  toys.  Each  subject  inter¬ 
acted  with  the  robot  for  20  to  30  minutes.  All  exchanges 
were  video  recorded  for  further  analysis. 

We  analyzed  the  video  for  evidence  of  social  amplifica¬ 
tion.  Namely,  did  people  read  Kismet’s  cues  and  did  they 
respond  to  them  in  a  manner  that  benefited  the  robot’s 
perceptual  processing  or  its  behavior?  we  found  several 
classes  of  interactions  where  the  robot  displayed  social  cues 
and  successfully  regulated  the  exchange. 

B.l  Establishing  a  Personal  Space 

The  strongest  evidence  of  social  amplification  was  appar¬ 
ent  in  cases  where  people  came  within  very  close  proximity 
of  Kismet.  In  numerous  instances  the  subjects  would  bring 
their  face  very  close  to  the  robot’s  face.  The  robot  would 
withdraw,  shrinking  backwards,  perhaps  with  an  annoyed 
expression  on  its  face.  In  some  cases  the  robot  would  also 
issue  a  vocalization  with  an  expression  of  disgust.  In  one 
instance,  the  subject  accidentally  came  too  close  and  the 
robot  withdrew  without  exhibiting  any  signs  of  annoyance. 
The  subject  immediately  queried,  “Am  I  too  close  to  you? 
I  can  back  up,”  and  moved  back  to  put  a  bit  more  space 
between  himself  and  the  robot.  In  another  instance,  a  dif¬ 
ferent  subject  intentionally  put  his  face  very  close  to  the 
robot’s  face  to  explore  the  response.  The  robot  withdrew 
while  displaying  full  annoyance  in  both  face  and  voice.  The 
subject  immediately  pushed  backwards,  rolling  the  chair 
across  the  floor  to  put  about  an  additional  three  feet  be¬ 
tween  himself  and  the  robot,  and  promptly  apologized  to 
the  robot. 

Overall,  across  different  subjects,  the  robot  successfully 
established  a  personal  space.  This  benefits  the  robot’s  vi¬ 
sual  processing  by  keeping  people  at  a  distance  where  the 
visual  system  can  detect  eyes  more  robustly.  This  behav¬ 
ioral  response  was  added  to  the  robot’s  repertoire  because 
previous  interactions  with  naive  subjects  illustrated  the 
robot  was  not  granted  any  personal  space.  This  can  be 
attributed  to  “baby  movements”  where  people  tend  to  get 
extremely  close  to  infants,  for  instance. 

B.2  Luring  People  to  a  Good  Interaction  Distance 

People  seem  responsive  to  Kismet’s  calling  behavior. 
When  a  person  is  close  enough  for  the  robot  to  perceive 
his/her  presense,  but  too  far  away  for  face-to-face  ex¬ 
change,  the  robot  issues  this  social  display  to  bring  the 
person  closer.  The  most  distinguishing  features  of  the  dis¬ 
play  are  craning  the  neck  forward  in  the  direction  of  the 
person,  wiggling  the  ears  with  large  amplitude,  and  vocal¬ 
izing  with  an  excited  affect.  The  function  of  the  display  is 
to  lure  people  into  an  interaction  distance  that  benefits  the 
vision  system.  This  behavior  is  not  often  witnessed  as  most 
subjects  simply  pull  up  a  chair  in  front  of  the  robot  and 
remain  seated  at  a  typical  face-to-face  interaction  distance. 

The  youngest  subject  took  the  liberty  of  exploring  differ¬ 
ent  interaction  ranges,  however.  Over  the  course  of  about 
fifteen  minutes  he  would  alternately  approach  the  robot 
to  a  normal  face-to-face  distance,  move  very  close  to  the 
robot  (invading  its  personal  space),  and  backing  away  from 


the  robot.  Upon  the  first  appearance  of  the  calling  re¬ 
sponse,  the  experimenter  queried  the  subject  about  the 
robot’s  behavior.  The  subject  interpreted  the  display  as 
the  robot  wanting  to  play,  and  he  approached  the  robot. 
At  the  end  of  the  subject’s  investigation,  the  experimenter 
queried  him  about  the  further  interaction  distances.  The 
subject  responded  that  when  he  was  further  from  Kismet, 
the  robot  would  lean  forward.  He  also  noted  that  the  robot 
had  a  harder  time  looking  at  his  face  when  he  was  farther 
back.  In  general,  he  interpreted  the  leaning  behavior  as 
the  robot’s  attempt  to  initiate  an  exchange  with  him.  We 
have  noticed  from  earlier  interactions  (with  other  people 
unfamiliar  with  the  robot)  that  a  few  people  have  not  im¬ 
mediately  understood  this  display  as  a  “calling”  behavior. 
The  display  is  flamboyant  enough,  however,  to  arouse  their 
interest  to  approach  the  robot. 

XII.  Limitations  and  extensions 

There  are  a  number  of  ways  the  current  implementation 
could  be  improved  and  expanded  upon.  Some  of  these  rec¬ 
ommendations  involve  supplementing  the  existing  frame¬ 
work,  others  involve  integrating  this  system  into  a  larger 
framework. 

Kismet’s  visual  perceptual  world  only  consists  of  what 
is  in  view  of  the  cameras.  Ultimately,  the  robot  should  be 
able  to  construct  an  ego-centered  saliency  map  of  interac¬ 
tion  space.  In  this  representation,  the  robot  could  keep 
track  of  where  interesting  things  are  located,  even  if  they 
are  not  currently  in  view.  Human  infants  engage  in  so¬ 
cial  referencing  with  their  caregiver  at  a  very  young  age. 
If  some  event  occurs  that  the  infant  is  unsure  about,  the 
infant  will  look  to  the  caregiver’s  face  for  an  affective  as¬ 
sessment.  The  infant  will  use  this  assessment  to  organize 
its  behavior.  For  instance,  if  the  caregiver  looks  frightened, 
the  infant  may  become  distressed  and  not  probe  further.  If 
the  caregiver  looks  pleased  and  encouraging,  the  infant  is 
likely  to  continue  exploring.  With  respect  to  Kismet,  it 
will  encounter  many  situations  that  it  was  not  explicitly 
programmed  to  evaluate.  However,  if  the  robot  can  en¬ 
gage  in  social  referencing,  it  can  look  to  the  human  for 
the  affective  assessment  and  use  it  to  bias  learning  and  to 
organize  subsequent  behavior.  Chances  are,  the  event  in 
question  and  the  human’s  face  will  not  be  in  view  at  the 
same  time.  Hence,  a  representation  of  where  interesting 
this  are  in  ego- centered  interaction  space  is  an  important 
resource. 

The  attention  system  could  be  extended  by  adding  new 
feature  maps.  A  depth  map  from  stereo  would  be  very  use¬ 
ful  -  currently  distance  is  only  computed  post-attentively. 
Another  interesting  feature  map  to  incorporate  into  the 
system  would  be  edge  orientation.  Wolfe  and  Triesman 
among  others  argue  in  favor  of  edge  orientation  as  a 
bottom-up  feature  map  in  humans.  Currently,  Kismet  has 
no  shape  metrics  to  help  it  distinguish  objects  from  each 
other  (such  as  its  toy  block  from  its  toy  dinosaur).  Adding 
features  to  support  this  is  an  important  extension  to  the 
existing  implementation. 

There  are  no  auditory  bottom-up  contributions.  A  sound 
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localization  feature  map  would  be  a  nice  multi-modal  ex¬ 
tension.  Currently,  Kismet  assumes  that  the  most  salient 
person  is  the  one  who  is  talking  to  it.  Often  there  are  multi¬ 
ple  people  talking  around  and  to  the  robot.  It  is  important 
that  the  robot  knows  who  is  addressing  it  and  when.  Sound 
localization  would  be  of  great  benefit  here. 
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XIII.  Conclusions 

Motor  control  for  a  social  robot  poses  challenges  beyond 
issues  of  stability  and  accuracy.  Motor  actions  will  be  per¬ 
ceived  by  human  observers  as  semantically  rich,  regardless 
of  whether  the  imputed  meaning  is  intended  or  not.  This 
can  be  a  powerful  resource  for  facilitating  natural  interac¬ 
tions  between  robot  and  human,  and  places  constraints  on 
the  robot’s  physical  appearance  and  movement.  It  allows 
the  robot  to  be  readable  -  to  make  its  behavioral  intent 
and  motivational  state  transparent  at  an  intuitive  level  to 
those  it  interacts  with.  It  allows  the  robot  to  regulate  its 
interactions  to  suit  its  perceptual  and  motor  capabilities, 
again  in  an  intuitive  way  with  which  humans  naturally  co¬ 
operate.  These  social  constraints  give  the  robot  leverage 
over  the  world  that  extends  far  beyond  its  physical  compe¬ 
tence,  through  social  amplification  of  its  perceived  intent. 
If  properly  designed,  the  robot’s  visual  behaviors  can  be 
matched  to  human  expectations  and  allow  both  robot  and 
human  to  participate  in  natural  and  intuitive  social  inter¬ 
actions. 
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